102 research outputs found
Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval
This paper presents a new state-of-the-art for document image classification
and retrieval, using features learned by deep convolutional neural networks
(CNNs). In object and scene analysis, deep neural nets are capable of learning
a hierarchical chain of abstraction from pixel inputs to concise and
descriptive representations. The current work explores this capacity in the
realm of document analysis, and confirms that this representation strategy is
superior to a variety of popular hand-crafted alternatives. Experiments also
show that (i) features extracted from CNNs are robust to compression, (ii) CNNs
trained on non-document images transfer well to document analysis tasks, and
(iii) enforcing region-specific feature-learning is unnecessary given
sufficient training data. This work also makes available a new labelled subset
of the IIT-CDIP collection, containing 400,000 document images across 16
categories, useful for training new CNNs for document analysis
Segmentation-Aware Convolutional Networks Using Local Attention Masks
We introduce an approach to integrate segmentation information within a
convolutional neural network (CNN). This counter-acts the tendency of CNNs to
smooth information across regions and increases their spatial precision. To
obtain segmentation information, we set up a CNN to provide an embedding space
where region co-membership can be estimated based on Euclidean distance. We use
these embeddings to compute a local attention mask relative to every neuron
position. We incorporate such masks in CNNs and replace the convolution
operation with a "segmentation-aware" variant that allows a neuron to
selectively attend to inputs coming from its own region. We call the resulting
network a segmentation-aware CNN because it adapts its filters at each image
point according to local segmentation cues. We demonstrate the merit of our
method on two widely different dense prediction tasks, that involve
classification (semantic segmentation) and regression (optical flow). Our
results show that in semantic segmentation we can match the performance of
DenseCRFs while being faster and simpler, and in optical flow we obtain clearly
sharper responses than networks that do not use local attention masks. In both
cases, segmentation-aware convolution yields systematic improvements over
strong baselines. Source code for this work is available online at
http://cs.cmu.edu/~aharley/segaware
Particle Videos Revisited: Tracking Through Occlusions Using Point Trajectories
Tracking pixels in videos is typically studied as an optical flow estimation
problem, where every pixel is described with a displacement vector that locates
it in the next frame. Even though wider temporal context is freely available,
prior efforts to take this into account have yielded only small gains over
2-frame methods. In this paper, we revisit Sand and Teller's "particle video"
approach, and study pixel tracking as a long-range motion estimation problem,
where every pixel is described with a trajectory that locates it in multiple
future frames. We re-build this classic approach using components that drive
the current state-of-the-art in flow and object tracking, such as dense cost
maps, iterative optimization, and learned appearance updates. We train our
models using long-range amodal point trajectories mined from existing optical
flow datasets that we synthetically augment with occlusions. We test our
approach in trajectory estimation benchmarks and in keypoint label propagation
tasks, and compare favorably against state-of-the-art optical flow and feature
tracking methods
Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?
Building 3D perception systems for autonomous vehicles that do not rely on
high-density LiDAR is a critical research problem because of the expense of
LiDAR systems compared to cameras and other sensors. Recent research has
developed a variety of camera-only methods, where features are differentiably
"lifted" from the multi-camera images onto the 2D ground plane, yielding a
"bird's eye view" (BEV) feature representation of the 3D space around the
vehicle. This line of work has produced a variety of novel "lifting" methods,
but we observe that other details in the training setups have shifted at the
same time, making it unclear what really matters in top-performing methods. We
also observe that using cameras alone is not a real-world constraint,
considering that additional sensors like radar have been integrated into real
vehicles for years already. In this paper, we first of all attempt to elucidate
the high-impact factors in the design and training protocol of BEV perception
models. We find that batch size and input resolution greatly affect
performance, while lifting strategies have a more modest effect -- even a
simple parameter-free lifter works well. Second, we demonstrate that radar data
can provide a substantial boost to performance, helping to close the gap
between camera-only and LiDAR-enabled systems. We analyze the radar usage
details that lead to good performance, and invite the community to re-consider
this commonly-neglected part of the sensor platform
- …